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Abstract — The paper evaluates the error performance of three 
random finite set based multi-object trackers in the context of 
pedestrian video tracking. The evaluation is carried out using 
a publicly available video dataset of 4500 frames (town centre 
street) for which the ground truth is available. The input to 
all pedestrian tracking algorithms is an identical set of head 
and body detections, obtained using the Histogram of Oriented 
Gradients (HOG) detector. The tracking error is measured using 
the recently proposed OSPA metric for tracks, adopted as the 
only known mathematically rigorous metric for measuring the 
distance between two sets of tracks. A comparative analysis is 
presented under various conditions. 

I. Introduction 

Random set theory has recently been proposed as a math- 
ematically elegant framework for Bayesian multi-object fil- 
tering [1|. Research within this theoretical framework has 
resulted in new multi-object filtering algorithms, such as the 
probability density hypothesis (PHD) filter [2|, Cardinalised 
PHD filter 0, and Multi-Bernoulli filter ID, 0. The main 
feature of multi-object filters is that they estimate sequentially 
the number of objects in the surveillance volume of interest 
(the so-called cardinality) and their states in the state space. 
Formulation of multi-object trackers from random-set based 
multi-object filters has attracted a lot of interest recently, see 
e.g. j5), |6), 0, (8). The output of a tracker is a set of 
tracks, that is, labeled temporal sequences of state estimates 
associated with the same object. 

In this paper we adopt the "tracking-by-detection" approach 
to pedestrian tracking, which has become very popular in com- 
puter vision due to its applicability to moving un-calibrated 
cameras (9), ifTOl - Typically pedestrian detections are obtained 
using the Histogram of Oriented Gradients (HOG) detector 
ifTTI . trained using either head images (for head detections) or 
body images (for body detections). An example of head and 
body detections is shown in Fig[T|(a). Head and body detec- 
tions are unreliable in the sense that: (1) not all pedestrians are 
detected in every frame; (2) the chance of false detections is 
quite real, with the spatial density of false detections typically 
non-uniform. This is evident in Fig[TJ(a). 

If the state vector of each object (pedestrian) contains the 
position (e.g. the head centroid), but not the size of the 
object, then both the head and body detections are instances 
of imprecise measurements: they represent rectangular regions 



(two-dimensional intervals) within which the true object is 
possibly located. As such, they can be modelled as random 
closed sets (rather than random variables). The first tracking 
algorithm considered in the paper is designed specifically for 
imprecise measurements: it represents a multi-object tracker 
built from the Bernoulli filter described in [12|. We will refer 
to this tracker as Algorithm 1. 

An imprecise measurement (e.g. an interval in the measure- 
ment space) can always be converted to a precise but random 
measurement (e.g. a point in the measurement space which 
is affected by additive noise). The two remaining algorithms 
considered in the paper assume precise random detections 
of heads/bodies for pedestrian tracking. Algorithm 2 is the 
same as Algorithm 1, but using the precise random (Gaussian) 
measurements model. Algorithm 3 is based on the Cardinalised 
PHD filter with data association [ 1 3 1 . 

Evaluation of multi-object tracking performance has been 
one of the main stumbling blocks in advancing the scientific 
field of target tracking. A large number of evaluation measures 
have been proposed, both in the general context (e.g. |14|, 
1 15]) and specifically for video surveillance (e.g. fl6l . IfTTI . 
1 18], 1 19 1). At present there is no consensus in sight on the 
preferred common approach. In this paper tracking error will 
be measured using the recently formulated Optimal sub-pattern 
assignment (OSPA) metric for tracks (or OSPA-T) GUI. The 
OSPA-T metric has an important advantage over all above 
mentioned performance metrics: it is a mathematically rigor- 
ous metric (it satisfies the axioms of a metric) for measuring 
the distance between two sets of tracks (i.e. between the 
ground truth and the tracker output). OSPA-T is also consistent 
with intuition, as discussed in ETI . 

The remainder of the paper is organised as follows. Sec- 
tion [II] describes the performance evaluation framework: the 
video dataset, the method of pre-processing detections and the 
OSPA-T metric. Section|III]reviews the three random-set based 
tracking algorithms. Section IV presents the experimental 



results under various conditions, while the conclusions of this 
study are drawn in Section |V| 





(b) 



Fig. 1. Frame 320 from the dataset used in performance evaluation: (a) Head and body detections (yellow and cyan rectangles, respectively); (b) tracker 
output (the result of recursive processing of 79 previous frames and the current frame); track trajectories indicated by white lines; red lines are velocity 
vectors; squares indicate the heads of pedestrians 



II. Performance evaluation framework 
A. Video dataset and detections 

The video dataset and the (hand labelled) ground truth are 
downloaded from the website l22l . The dataset is a video of 
an occasionally busy town centre street (the average number of 
pedestrians visible at any time is 16). The video was recorded 
in high definition (1920 x 1080 pixels at 25 fps). Only the 
first 4500 frames of the video are used in the performance 
evaluation. Frame number 320 of the dataset is shown in Fig[T] 

Head detection and pedestrian body detection algorithms 



were applied to each frame in the video sequence. The 
fastHOG GPU implementation E3l of the Histogram of Ori- 
ented Gradients (HOG) detection algorithm ifTTI was used for 
both detectors. The HOG detector applies a sliding window 
to the image at a regular grid of locations and scales, and 
classifies each sub-window as containing or not containing an 
object (head or pedestrian). Classification is performed using a 
pre-trained linear Support Vector Machine, the input to which 
is a set of block-wise histograms of image gradient orientation. 
A classification threshold of 0.75 was used for both detectors. 



Sliding window detectors tend to give multiple detections for 
one object due to their tolerance to shift and scale, so a post- 
processing step groups overlapping detections. 

The head and pedestrian (whole body) detections have some 
complementary characteristics. The detector is only partially 
tolerant to occlusions, so the head detector tends to have 
a higher probability of detection since heads are generally 
more visible than whole bodies in surveillance video. However 
pedestrian textures are more distinctive than head textures, 
so the head detector tends to have a higher false alarm rate, 
picking up on round-ish background objects such as clocks and 
signs. The pedestrian detector is more able to detect people at 
a distance where the head becomes too small in the image. 

Head and body detections are treated as if they are indepen- 
dent. Each tracker can then be regarded as a centralised multi- 
source fusion node, where one source of detections is the head 
detector while the other is the body detector. The rectangles 
corresponding to body detections are converted to head-like 
detections as follows. Suppose a body detection rectangle is 
specified by its upper-left corner (xbt Vb), width Wb and height 
hb- Then for its corresponding head-like detection, the upper- 
left corner coordinates are computed as: x'h = Xb + 0.325 Wb 
and f]' h = r/b + 0.09 hf,, while the width and height are 
w' h = 0.35 Wb and h' h — 0.19 hb, respectively. 

B. OSPA-T metric 

Traditional multi-object tracking performance measures de- 
scribe various aspects of tracking performance, such as time- 
liness (e.g. track initiation delay, track overshoot), track accu- 
racy (e.g. position, heading, velocity error), track continuity 
(e.g. track fragmentation, track labelling swaps) and false 
tracks (their count and duration). These measures are based 
on heuristic, and it is unclear how to combine them into a 
single score because they are correlated. 

OSPA-T ll20l is defined as a theoretically rigorous distance 
measure on the space of finite sets of tracks, and it has been 
proven that it satisfies the axioms of a metric. The computation 
of OSPA-T is described in Table [I] Suppose we are given two 
sets of tracks, the ground truth tracks {X^ x > , . . . ,X^} and 
estimated tracks {y«, Y^}. A track X®, I = 1, . . . , L, 
is defined as a temporal sequence X® — (x[ l \ . . . , Xjg) 
where each Xjj', k = l,...,K, is either an empty set (if 
track does not exist at time k) or a singleton whose element 
is (i,Xfe). Here I £ N is the track label and x& is its state at 
time k. The labels of ground truth tracks are by convention 
adopted to be 1,2,..., L. 

The first step in the computation of OSPA-T is to label 
the estimated tracks (steps 3,4 and 5 in Table B. This first 
involves finding the best assignment A* of R estimated tracks 
to L ground truth tracks. An assignment is a mapping X*(£) £ 
U {1, . . . , R}, for £ — 1, . . . , L. This is typically carried out 
using a two-dimensional assignment algorithm, such as the 
auction or Munkres algorithm l24l . If for an estimated track 
r = 1, . . . ,R exists a true track I such that A* (I) = r, then 
track r is assigned label £, Estimated tracks which remain 



unassigned according to A* are given labels different from all 
true track labels (i.e. integers greater than L). 

Then, for each time step k = 1, . . . ,K, the OSPA distance 
between the two labeled sets: 

X k = {(Zi,x fc)1 ),...,(Z m ,Xfc )m )} (1) 

2)fc = {(si,yfc,i)'--->( s «,yfe,n)} ( 2 ) 

is computed. The set Xk represents the set of existing ground 
truth labeled track states at time k; similarly 2)j. is the set of 
existing estimated labeled track states at time k. The OSPA 
distance between these two labeled sets is computed as |20|: 

' 1 ( m P M 1/p 

- I mill ^2 ( d c(xfc,i,y fe)W (i))) + (n - to) • c p J 

(3) 

where x feii = (7 4 ,x M ), y k ,n(i) = (s n {i),yk,ir{i)) and 

• d c (x,y) = min(c, rf(x, y)) is the cut-off distance be- 
tween two tracks at tk, with c > being the cut-off 
parameter; 

• d(x, y) is the base distance between two tracks at 

• n„ represents the set of permutations of length m with 
elements taken from {1,2, ... ,n}; 

• p E [1, 00) is the OSPA metric order parameter. 

For the case m > n, the definition is D p c (£f£)) = 
-Dp,c(2)fc! 3-fc)- If both Xk and 2)^ are empty sets (i.e. m = 
n = 0), the distance is zero. 

TABLE I 

Computation Steps of OSPA-T Metric 



1: function OSPA-T({A r(1) , . . . , X {L) }, {Y m , . . . , Y (R) }) 

2: % Label the estimated tracks 

3: For j = 1, . . . , R, Label[F (j) ] = I (where I > L) 

4: Find A*, the globally best assignment of tracks 

{x^,...,x^} to{y«,...,y<*>} 

5: For i = 1,...,L, Label[y (A * w) ] = Label[X (i) ] 

6: % Compute the distance 

7: For k = 1,...,K 

8: Form the labeled sets at tk'. 

9: -Ground truth: X k = {(Zi,x fea ), . . . , (l m ,Xk,m)} 

10: -Estimated: 2J fe = {(si, yk.i), . . . , (s„, yk,n)} 

11: Compute the OSPA distance between Xk and 2)^ 

12: end function 



The base distance d(x, y) is defined as: 

d(x,y) = (d^yf + d a (l >S y') \ (4) 

where: p' E [l,oo) is the base distance order parameter; 
di(x,y) is the localisation base distance, typically adopted as 
the p'-norm: de(x,y) = j|x — y||j/; d a (l,s) is the labeling 
error, adopted as: d a (s,t) = a8[s,t], where 5[i,j] is the 
complement of the Kroneker delta, that is S[i,j] = if i = j, 
and S[i,j] = 1 otherwise. Parameter a G [0, c] here controls 
the penalty assigned to the labeling error d(s,t) interpreted 



relative to the localisation distance d^(x, y). The case a = 
assigns no penalty, and a — c assigns the maximum penalty. 

Since in this paper we consider a sequence of a large number 
of frames (K — 4500), the OSPA-T is applied over non- 
overlapping segments (blocks) of frame^] 

C. Base distance is a metric 

The base distance d(x, y), defined in satisfies the three 
axioms of a metric: identity, symmetry and triangle inequality. 
To prove identity and symmetry is trivial. The proof of triangle 
inequality, presented in 11201 Sec.III.A], is wrong and this 
section presents the correct proof. 

Let x = (I, x), y = (s, y), z = (u, z). The following proof 
for the triangle inequality is given in ll20l Sec III.A] 



d(i,y) p <d(x,z) p + ( i(z,yf 



(5) 



where in Sec |II-B and [20 1 notation p' was used instead of p. 
Equation |5]) is wrong and this can be seen for example by 
adopting: p = 2, x = (1,0), y = (1,5), z = (1,4.99). Then 
d(k,y) p = 25 and d(x,z) p + d(z,y) p 24.90. Moreover, 
(|3j does not prove the triangle inequality. 
We want to prove that 

d(x,y) < d(x,z) + d(z,y) (6) 

where according to Q 

d(^y) p ^d e ( X ,y) p + a p S[l, S }. (7) 

As di (•, •) is a metric, it meets the triangle inequality 

dt (x, y) < d g (x, z) + dt (z, y) (8) 

As both sides of the inequality are positive numbers and p > 1 



d,(x,y) p <(d e (x,z)+d e (z,y)) p 
We also have that 

aS [I, s] < aS [I, u] + ad [u, s] 
As both sides of inequality ( [T0| are positive and p > 1 

(a5 [I, s]) P < (ai5 [I, u] + aS [u, s]) P 
Using and <[TT} 
d,(^y) p + (aS[l, S }) P < 

di (x, z) + db (z, y) I + I a5 [I, u] + aS [u, s 



(9) 



(10) 



(11) 



that is 



dt (x,yy+{a5{l,s}) P < 



y(d e (x, z) + dt (z, y)) P + (aS [I, u] + aS [u, s])" (13) 

1 The MATLAB source code for computation of OSPA-T metric, including 
the head and body detections for running and comparing different tracking 
algorithms, can be obtained upon request from the first author. 



As p > 1, using the Minkowski inequality [25 1 on the right 
hand side of (jT3j 

^ (d e (x, z) + d e (z, y)f + (aS [I, u] + aE [u, s]) P < 

dt (x, z) p + aPS M+ tfdt(z., y) p + aP~8 [u, s] (14) 



Finally, using ( 13 i and ( 14 1, we get 



Vd t (x,y) p + ar8 [l,s] < 



'd e (x, z) p + aPS [I, u] + yd e (z, y) p + a PS [u, s] (15) 
The proof is finished using 

d(x,y) <d(x,z)+d(z,y). (16) 

III. Description of Algorithms 

The state vector of a single object is adopted for all algo- 
rithms as x = [x x y y] T , where (x, y) is the position (in 
pixels) of the pedestrian head centroid and (x, y) is its velocity 
vector (in pixels/s). The number of objects from frame to frame 
varies. The random finite set of head detections at frame k is 
denoted z9~\ Accordingly, the random set of head-like body 



detections (see the explanation in the last sentence of Sec II-A I 



(2) 



Algorithms 1 and 2 are based on the multi-sensor Bernoulli 
filter [26 1, where the "sensors" are the two types of pedestrian 
head detections. Separate and independent Bernoulli filters are 
run for each target. Target interactions are taken care of by 
the appropriately increased clutter intensity, as in ETl . This 
multi-object tracking algorithm has been described in some 
detail in [28]. The difference between Algorithms 1 and 2 
is in the model of the single-object likelihood function. Let 
C £ \ f° r £ = 1, 2, be a detection resulting from an object 
(i.e. a pedestrian head) in the state x. A head detection is a 
rectangle, thus £ is specified by a tuple (x,T], w, h), where 
(%, rj) determines its upper-left comer, while w and h are the 
width and height, respectively. 

The single-object likelihood function used in Algorithm 1 
treats the detection £ as an imprecise measurement and is 
defined as in lfl2l : 



5i l) (C|x) 



;Hx;C,S«)-^(Hx;C,£ W ) (17) 



where <p(z; (j,, X( 1 ') is the Gaussian cumulative distri- 
bution function with mean p, and covariance = 



diag[cri i ' ) , ay' ]; £ and £ are the lower and upper bound 



(12) of the rectangle, and H 



If <7£ 



i) 



0, 



10 
10 

then |l7| simply states that gj^(C|x) = 1 if (x,y) is inside 
the rectangle (, and zero otherwise. The algorithm is applied 
to the video dataset using cr^ = 1 and Ux = 25. 

Algorithms 2 and 3 first convert the rectangular detection 
C into a point measurement z = [x + w/2, rj + h/2] J , with 
the associated covariance matrix R = diag[(u>/6) 2 , (h/6) 2 ]. 
Then the single-object likelihood function of z is adopted as: 



^ ) (z|x)=AA(z;Hx,R) 



(18) 



where J\f(m; /i, P) is the Gaussian probability density func- 
tion with mean /i and covariance P. 

Algorithm 3 is based on the Cardinalised PHD (CPHD) 
filter [3], but with additional logic to deal with track labeling. 
The key idea of lfl3l is to form the clusters of targets, and 
to apply the CPHD filter update to each cluster separately. 
The update uses every available detection (measurement) to 
calculate the weight of the track-to-measurement association. 
The weight of no-measurement association is also computed. 
Finally these weights are used to form an association matrix 
which is solved using a two-dimensional assignment algorithm 
(e.g. auction, Munkres). At last each predicted track is updated 
with the measurement which has been assigned to it by the 
assignment algorithm. Since we have at our disposal two types 
of detections (Zr are head detections, and are head-like 
body detections), the update step in Algorithm 3 is applied 
twice, first using Z^ and then using Z?\ Although this is not 
an optimal approach [29 1, it has been suggested as a reasonable 
approximation. 

TABLE II 

A Summary of the Contesting Tracking Algorithms 



Alg. Likelihood function Method 



1. Eq.J 


17| 


I Multi-Bernoulli Tracker of |28| 


2. Eq. 




) Multi-Bernoulli Tracker of [28] 


3. Eq. 




I CPHD based tracker [13] 



All three algorithms used the same clutter maps (one map 
for heads, the other for body detections). The probability of 
detection was set to = 0.58 and Pffl = 0.52. A short 
summary of algorithms is given in Table [II] 

IV. Numerical results 

The localisation base distance of the OSPA-T error 
d(x, y) = ||x — y|| P ' only takes into account the positional 
error (i.e. neglecting the velocity error). Figj2] shows the 
resulting OSPA-T error for the three random-set based tracking 
algorithms. The parameters of the OSPA-T metric used in 
evaluation: p = p' = 1, c = 100 and a = 75. Identical 
head detections and body-to-head converted detections, from 
every frame, have been used by all three algorithms. Figj2] 
also shows, as a guideline, the OSPA-T error of the Benfold- 
Reid (BR) algorithm iflOl . whose tracking results are available 
online |22l . We point out that the comparison between the BR 
algorithm and the three random-set based trackers is not fair 
because the BR algorithm is a smoother (operates as a batch 
algorithm over a sliding window of image frames) and does 
not use body/head detections in every frame. From Figj2] one 
can observe that ranking of the algorithms according to OSPA- 
T varies with time. For example, from frame number 800 to 
1100, the BR algorithm is far superior than the random-set 
based trackers, but the opposite is true from frame 1400 to 
1600. In order to obtain an overall ranking, the time averaged 
OSPA-T error has be computed: its value for Algorithms 1, 2, 3 
and the BR algorithm is 45.2, 42.8, 40.7 and 40.4, respectively. 



The conclusion is that the most accurate of the three random- 
set tracking algorithms is Algorithm 3. Furthermore, it appears 
that the imprecise measurement model is not justified in the 
adopted context: the transformation of head and body-to-head 
rectangles (imprecise detections) into random precise mea- 
surement points provides better tracking results. This can be 
explained by the nature of head and body-to-head rectangular 
detections; it has been observed that if a detection is not false, 
then its rectangular centre is a very accurate estimate of the 
centre of a pedestrian head. Thus the likelihood (17) , which 
is based on the interpretation that the true head centroid is 
somewhere inside the rectangle, appears to be too cautious 
and consequently does not use the full information content of 
a measurement. 

We repeated the OSPA-T error computations for a = 
(no penalty for the labeling error). This case corresponds 
to the original OSPA error proposed in ETl . The obtained 
time averaged OSPA-T error for Algorithms 1, 2, 3 and the 
Benfold-Reid algorithm in this case were 34.1, 29.5, 27.4, 
and 30.2, respectively. Again Algorithm 3 performs the best 
among the random-set based trackers, and even outperforms 
the Benfold-Reid algorithm. This result reveals that the major 
problem with Algorithm 3 is the lack of track consistency 
(too many broken tracks), which by adopting a = is not 
penalised. Track consistency can be improved by smoothing 
over multiple image frames (to be considered in the future 
work). 

Head and body detection algorithms are very computation- 
ally intensive and consequently in real-time applications it may 
not be possible to provide them at every image frame. Next 
we compare the OSPA-T error performance of Algorithm 3 
for the situations where head and body detections are available 
for: (1) every frame, (2) every 2nd frame, (3) every 4th frame 
and (4) every 8th frame. The results are shown in Fig(3] We 
note that the error performance does not change dramatically 
with the reduced frequency of head and body detections. The 
time averaged OSPA-T error for the four cases are: 40.7, 
37.9, 39.2, and 42.7. Somewhat surprisingly, using body/head 
detections every 2nd and every 4th frame, reduces the number 
of false tracks and overall improves the accuracy. Only when 
body/head detections become available only every 8th frame, 
some of the true tracks start to be missing occasionally and 
consequently the OSPA-T error performance deteriorates. 

V. Conclusions 

The paper presented a framework for performance evalua- 
tion of multi-object trackers. The framework is illustrated in 
the context of video tracking by comparison of three random- 
set based pedestrian tracking algorithms, using a video data 
set of a busy town centre. The multi-object tracking error 
was evaluated using the "OSPA for tracks" (OSPA-T) metric. 
The OSPA-T metric has an important property that it satisfies 
the axioms of a metric. The mathematical proof the triangle 
inequality axiom is presented in the paper. 

The results of performance evaluation indicate that the 
CPHD based tracker of lfl3l performs the best. Although 




Fig. 2. Comparison of tracking algorithms: OSPA-T error for the sequence of 4500 frames; OSPA-T metric parameters: p = p' = 1, c = 100 and a = 75 



this is a single-frame recursive algorithm, its performance is 
comparable to that of [ 1 1 (which operates over a sliding 
window of frames). Future work will consider a smoothing 
version of the algorithm in lfl3l since a delay by a few frames 
in reporting the tracks is tolerable and has the potential to 
further improve the tracking accuracy. 
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